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Abstract 

We  present  a  background  model  that  differentiates  be¬ 
tween  background  motion  and  foreground  objects.  Unlike 
most  models  that  represent  the  variability  of  pixel  intensity 
at  a  particular  location  in  the  image ,  we  model  the  under¬ 
lying  warping  of  pixel  locations  arising  from  background 
motion.  The  background  is  modeled  as  a  set  of  warping 
layers,  where  at  any  given  time,  different  layers  may  be  vis¬ 
ible  due  to  the  motion  of  an  occluding  layer.  Foreground 
regions  are  thus  defined  as  those  that  cannot  be  modeled 
by  some  composition  of  some  warping  of  these  background 
layers.  We  illustrate  this  concept  by  first  reducing  the  pos¬ 
sible  warps  to  those  where  the  pixels  are  restricted  to  dis¬ 
placements  within  a  spatial  neighborhood,  and  then  learn¬ 
ing  the  appropriate  size  of  that  spatial  neighborhood.  Then, 
we  show  how  changes  in  intensity /color  histograms  of  pixel 
neighborhoods  can  be  used  to  discriminate  foreground  and 
background  regions.  We  find  that  this  approach  compares 
favorably  with  the  state  of  the  art,  while  requiring  less  com¬ 
putation. 

1.  Introduction 

Background  subtraction  is  a  common  pre-processing 
step  to  many  vision  tasks  such  as  object  detection,  local¬ 
ization,  recognition,  categorization,  etc.  In  this  context, 
a  (foreground)  “object”  is  defined  as  a  compact  region  of 
space  that  is  “different”  from  the  background,  and  since 
the  background  is  often  modeled  as  a  static  map  or  distri¬ 
bution,  any  background  motion  triggers  the  detection  of  a 
novel  object.  However,  in  environmental  monitoring  sce¬ 
narios,  the  background  undergoes  complex  motions  with 
self-occlusions  that  challenge  these  models  even  when  the 
camera  is  not  moving.  Natural  environments,  such  as  the 
forest  canopy,  present  a  significant  challenge  because  of  the 
complex  occlusion  structure  and  motion  of  foliage,  and  the 
rapid  illumination  changes  due  to  transitions  between  light 
and  shadow  (also  an  occlusion  phenomenon).  Clearly,  rep¬ 
resenting  or  learning  an  accurate  model  of  the  background  is 
not  a  viable  proposition.  Instead,  we  present  a  simple  model 


that  captures  the  phenomenology  of  background  variations 
due  to  motion  and  occlusions  for  the  purpose  of  detecting 
foreground  objects  within. 

We  define  as  “background”  the  portions  of  the  scene  that, 
over  relatively  long  observation  times,  remain  within  the 
field  of  view,  even  though  they  may  move  and  even  dis¬ 
appear  temporarily  due  to  partial  occlusions.  Therefore,  we 
model  the  background  as  a  collection  of  layers  (or  “canoni¬ 
cal  images”)  that  can  move  (undergo  domain  deformations, 
or  “warpings”),  and  occlude  each  other  to  yield  a  generic 
background  image.  A  foreground  region,  or  “object,”  is 
thus  another  layer  that  cannot  be  obtained  as  a  warping  of 
a  canonical  image.  Allowing  permutations  of  layers  effec¬ 
tively  tightens  the  background  distributions  when  there  is 
significant  background  movement  and  intensity  variation. 

Unfortunately,  finding  the  optimal  unconstrained  warp¬ 
ing  and  layer  combination  that  yield  a  sample  image  would 
be  computationally  infeasible.  We  therefore  evaluate  a 
number  of  possible  techniques,  each  constraining  the  pos¬ 
sible  warping  functions  differently.  Because  background 
motion  tends  to  be  small  (consider  foliage  moving  in  the 
wind),  we  limit  the  warping  of  image  domains  to  a  small 
spatial  neighborhood,  first  heuristically,  then  by  learning 
the  distribution  from  the  data.  We  implement  two  differ¬ 
ent  functions  to  constrain  this  motion  -  a  step  function  and 
a  Gaussian  window.  When  determination  of  the  particular 
pixel  warping  is  irrelevant,  we  propose  a  different  approach 
using  blocks  of  pixels.  This  “implicit”  approach  determines 
whether  a  patch  in  a  sample  image  can  be  generated  by 
warping  a  similar  patch  in  the  prototype  background  im¬ 
ages,  without  representing  such  a  warping  explicitly.  Using 
this  latter  technique,  we  obtain  greater  accuracy  in  back¬ 
ground/foreground  labeling  with  faster  computation  time. 

We  use  bird  monitoring  in  natural  habitats  as  a  motivat¬ 
ing  application  to  bring  attention  to  a  larger  class  of  prob¬ 
lems  not  previously  addressed  in  the  literature.  A  number 
of  pertinent  questions  about  the  impact  of  climate  change 
on  our  ecosystem  are  most  readily  answered  by  monitor¬ 
ing  fine- scale  interactions  between  animals  and  plants  in 
their  environment.  Such  fine  scale  measurements  of  species 
distribution,  feeding  habits,  and  timing  of  plant  blooming 
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events  require  continuous  monitoring,  a  task  plagued  by  the 
very  challenges  described  in  this  paper. 

2.  Prior  Work 

To  detect  novel  objects  in  an  image,  many  approaches 
model  each  pixel  independently,  p(It(x))  ~  rx  Vt.  W4  [5] 
model  rx  using  the  variance  found  in  a  set  of  background 
images  with  the  maximum  and  minimum  intensity  value 
and  the  maximum  difference  between  consecutive  frames. 
Pfinder  [24]  learns  the  mean  and  the  variance  of  pixel  val¬ 
ues  at  each  location  in  the  training  set.  If  the  mean  and 
variance  are  all  that  is  known  about  a  distribution,  the  most 
reasonable  choice  of  distribution  based  on  maximal  entropy 
is  the  Gaussian.  The  assumption  then  is  rx  =  A/^/^cr2), 
and  a  likelihood  model  is  used  to  classify  background  and 
foreground  at  each  pixel. 

When  this  assumption  does  not  adequately  capture  the 
distribution,  a  Mixture  of  Gaussians  (MoG)  can  be  used 
[20,  '  ]  to  further  improve  the  accuracy  of  the  estimate. 
A  MoG  model,  where  rx  =  ^ Wi  of),  is  capa¬ 

ble  of  handling  a  range  of  realistic  scenarios,  and  thus 
is  widely  used  [6,  21]  to  tackle  the  background  subtrac¬ 
tion  problem.  Elgammal  et  al.  [3]  show  it  is  possible 
to  achieve  greater  accuracy  using  a  non-parametric  model 
rx{i)  =  |/|_1  YhteT  —i),  where  K  is  kernel  func¬ 

tion  and  i  span  the  range  of  possible  pixels  value  at  the  pixel 
x.  Another  contribution  of  this  work  is  the  incorporation  of 
spatial  constraints  into  the  formulation  of  foreground  clas¬ 
sification.  In  the  second  phase  of  this  approach,  pixel  val¬ 
ues  that  can  be  explained  by  distributions  of  neighboring 
pixels  are  reclassified  as  background,  allowing  for  greater 
resilience  against  dynamic  backgrounds.  Sheikh  and  Shah 
unify  the  temporal  and  spatial  consistencies  into  a  single 
model  [1  ].  Similar  models  include  [13,  16,  17].  Look¬ 
ing  at  the  statistics  at  a  single  pixel  shown  in  the  central 
figure  in  Fig.  1,  we  see  that  the  distribution  of  background 
pixels  spans  almost  all  grayscale  intensities,  and  that  the 
foreground  distribution  mostly  overlaps.  This  indicates  that 
there  is  a  large  overlap  between  background  and  foreground 
distributions,  resulting  in  many  false  positives  or  misses. 

A  different  approach,  taken  by  Oliver  et  al.  [15],  looks 
at  global  statistics  rather  than  local  constraints.  Similar  to 
eigenfaces,  a  small  number  of  “eigen-backgrounds”  are  cre¬ 
ated  to  capture  the  dominant  variability  of  the  background. 
The  assumption  is  that  the  remaining  variability  in  an  im¬ 
age  is  due  to  foreground  objects.  The  “eigen-background” 
approach  works  well  for  global  changes  in  the  background, 
such  as  variable  illumination,  but  does  not  work  well  when 
the  variability  is  local.  If  there  are  small  changing  regions 
in  the  background,  as  is  the  case  in  natural  environments, 
the  intensities  of  pixels  A  and  B  in  Fig.  1  do  not  correlate, 
making  “eigen-backgrounds”  a  poor  model. 

Yet  another  approach  assumes  that  a  background  pixel 


is  generated  with  a  distribution  that  is  based  on  its  his¬ 
tory,  It(x )  ^  i(x).  The  simplest  of  these  models, 

frame  differencing  [  ],  thresholds  the  difference  between 
two  frames  of  a  sequence,  and  large  changes  are  consid¬ 
ered  foreground.  To  resolve  ambiguity  due  to  slowly  mov¬ 
ing  objects,  Kameda  and  Minoh  [8]  use  a  “double  differ¬ 
ence”  that  classifies  foreground  as  a  logical  “add”  of  the 
pair-wise  difference  between  three  consecutive  frames.  A 
compromise  between  differencing  neighboring  frames  and 
differencing  against  a  known  background  image  is  to  adapt 
the  background  over  time  by  incrementally  incorporating 
the  current  image  into  the  background.  Migliore  et  al  [12] 
integrate  frame  differencing  and  background  modeling  to 
improve  overall  performance. 

Rather  than  implicitly  modeling  the  background  dynam¬ 
ics,  many  approaches  have  explicitly  modeled  the  back¬ 
ground  as  composed  of  dynamic  textures  [2].  Wallflower 
[22]  uses  a  Wiener  filter  to  predict  the  expected  pixel  value 
based  on  the  set  of  past  samples  whose  a’s  are  learned. 
Monnett  et  al.  [14]  model  the  background  as  a  dynamic 
texture  [19],  where  the  first  few  principal  components  of 
the  variance  of  a  set  of  background  images  (similar  to 
[1  ])  comprise  an  autoregressive  model  in  the  same  vein  as 
[22,  9].  As  shown  in  Fig.  1,  pixels  do  not  change  in  a  pre¬ 
dictable  way  over  time,  making  dynamical  models  a  poor  fit 
for  representing  the  background. 

3.  Approach 

Our  goal  is  to  model  the  “usual”  pixel  values  for  back¬ 
ground  and  detect  the  “unusual”  pixels  in  image  sequences 
captured  from  a  fixed  camera.  We  assume  we  start  with 
a  small  number  of  training  images,  T  =  •  t  = 

1, . . . ,  T;  x  G  ft},  where  /  :  Q  C  M2  — >  M+;  x  i— »  It(x), 
and  the  goal  is  to  label  all  pixels  in  any  image  in  the  se¬ 
quence,  It  (x) ,  \/t  >  T. 

We  assume  that  these  images  can  be  constructed  from  a 
canonical  image  Iq  through  some  warping  of  the  domain  in 
Iq.  That  is, 

It(x)  =  i0(wt(x)),  (1) 

where  wt  ~  q.  The  warping,  wt,  is  drawn  from  some  dis¬ 
placement  distribution  q  independently,  so  at  any  time  t,  any 
warping  can  be  selected. 

This  model  is  valid  only  away  from  occlusions  O  C  £7 
[1].  At  occluded  regions,  a  different  scene  is  visible  I\  that, 
in  general,  has  no  relation  with  Iq.  More  generally,  there 
can  be  an  arbitrary  number  of  occlusion  layers,  any  of  which 
can  become  visible  at  a  given  instant  in  time.  Thus,  rather 
than  using  a  generative  model  derived  from  a  single  warped 
image,  we  can  model  the  composition  of  several  layers  [23] 
each  warped  independently.  A  sample  image  is  constructed 
by  selecting  the  best  warping  from  each  canonical  image,  or 


Figure  1.  (Left)  An  image  from  a  sample  sequence.  (Center)  The  distribution  of  pixel  values  at  location  A,  separated  into  background 
(above)  and  foreground  (below)  pixels.  Because  of  the  range  of  background  pixel  intensities  and  the  overlap  of  foreground  and  background 
distributions,  modeling  pixels  individually  would  result  in  poor  classification  performance.  (Right)  The  intensity  variation  of  pixels  A  and 
B  are  not  correlated  over  time,  indicating  that  “eigen-background” -based  methods  do  not  capture  these  background  changes. 


layer,  /&,  such  that, 


B 

b(x)  =  ^2h(wt,b(x))Xb,t(x)i 
6=1 


(2) 


where 

Xbt(x)  =  l  1 

^  |  0  otherwise. 

But  for  an  observed  image  It ,  we  do  not  know  if  it  con¬ 
tains  foreground  objects,  how  the  background  is  warped 
(unknown  wt,b{%))  or  where  the  occlusions  occur  (unknown 
Xb,t{x)).  The  pixel- wise  discrepancy  is  thus: 


B 

Dt(x)  =  || It{x)  ~'^2ib(wt,b(x))xb,t(x)h,  (3) 

6=1 

where  wt,b{%)  is  the  estimated  warp  and  Xbt  ( x )  is  the  esti¬ 
mated  occlusion  map.  Depending  on  the  application’s  toler¬ 
ance  for  false  positives  versus  missed  detections,  a  thresh¬ 
old  can  be  applied  to  the  difference  image  for  segmentation. 
The  focus  for  the  rest  of  this  section  is  to  model  q  adequately 
and  in  a  computationally  feasible  way. 


3.1.  Modeling  warp 

In  practice,  not  all  warpings  are  plausible.  Rather  than 
allowing  arbitrary  warpings,  we  limit  a  pixel’s  possible 
warpings,  Q ,  to  its  spatial  neighborhood,  where  q  is  in  the 
set  Q ,  and 


where  Q  is  the  set  of  g’s  that  refer  to  previously  matched 
x’s.  We  then  use  the  best  warping  from  the  set  of  canonical 
images  as  the  unoccluded  region, 


Xb,t(x) 


1 

0 


argmin  ||it(z)  -  ib(wt,b(x))  ||2  =  b, 

6=1,..  ,B 

otherwise. 

(6) 


This  formulation  can  result  in  having  no  possible  warp  for 
a  particular  b(x).  That  is,  Q  =  Q.  In  this  case,  since  there 
are  no  pixels  that  can  be  matched,  we  assume  the  pixel  is 
foreground,  or  an  occlusion  not  modeled  by  the  selected  lb. 

Using  a  uniform  distribution  around  the  pixel  can  result 
in  poor  matches  when  performing  greedy  matching.  Since 
it  is  more  likely  that  pixels  are  only  warped  slightly,  we 
would  like  to  bias  our  selected  warps  to  those  with  minimal 
distance  from  the  original  location.  To  do  this,  we  augment 
our  minimization  to: 


wt,b(x)  =  argmin  ga(x-Wt{x))\\It(x)-Ib(wt,b{x))\\2 

'Wt ,  b  (*£)  £  Q  \Q 


(7) 

where  Q^ix)  =  J_ e~x2 /(.2(j2)  m 

V27tcH 

Both  approaches  -  using  the  uniform  distribution  and  the 
Gaussian  distribution  -  have  parameters  that  can  be  learned 
from  the  data.  In  the  uniform  case,  we  can  learn  the  appro¬ 
priate  Ax  for  each  pixel  x.  For  the  Gaussian  distribution, 
we  can  learn  the  appropriate  cr2 . 


q(x)  Unif[x  —  Ax,x  +  Ax\.  (4) 

We  select  a  warping,  wt  for  It  in  a  greedy  fashion.  For 
each  pixel  x ,  we  find  the  best  warping  wt,b{x),  restricted 
to  pixels  not  previously  warped  from  the  canonical  image 
lb  and  to  the  square  neighborhood  specified  by  Ax.  The 
“best”  warping  for  each  lb  is  defined  by  the  similarity  of  its 
appearance, 

wt,b(x)  =  argmin  \\It (x)  -  ib(wttb(x))\\2  (5) 

wt,b(x)eQ\Q 


3.2.  Implicit  warping 

In  reality,  we  are  not  interested  in  the  precise  warping  of 
canonical  images  to  sample  image.  Often,  it  is  enough  to 
know  where  the  warping  model  fails,  indicating  foreground 
objects  are  present.  Given  the  assumption  that  background 
motion  is  local,  we  can  estimate  how  closely  a  patch  of  pix¬ 
els  in  lb  matches  those  of  It  by  measuring  the  distance  of 
the  distribution  of  pixels  of  each  patch,  instead  of  simply 
pixels. 


Figure  2.  The  top  row  shows  the  resulting  difference  per  pixel  using  intensity  as  the  feature,  and  the  bottom  row,  using  YUY  as  the  feature. 
Left  to  right:  1)  Raw  image,  It.  2-5)  Difference  image  when  Ax  =  1,  5, 9, 15  respectively.  Black  pixels  indicate  a  perfect  match,  white 
indicates  no  match,  and  the  grays  in  between  represent  the  “goodness”  of  the  warp  assigned. 


We  redefine  the  distance  from  the  background,  according 
to  this  function: 

Dt(x)  =  .jnin  Bd(hxJt,hxjb),  (8) 

where  d(x,y)  =  1  —  ^  y/XiUi,  the  inverse  of  the  Bhat- 
tacharyya  distance,  and  the  histogram, 

/ix,/(2/)  =  4E^W)-y)-  (9) 

wz  z ' 

jeJ 

J  is  limited  to  the  spatial  neighborhood  of  x,  so  that  J  = 
{j  :  x  —  w/2  <=  j  <  x  +  w/2}.  The  histogram,  h  is 
defined  over  the  range  of  the  image,  j  G  [0  1] .  A  Gaussian 
blur  is  used  to  smooth  away  the  artificial  edges  induced  by 
restricting  subimages  to  non-overlapping  blocks  of  the  im¬ 
age. 

4.  Results 

Detailed  experiments  are  run  on  a  200  frame  image  se¬ 
quence  of  birds  at  a  feeder  station  from  the  data  set  re¬ 
leased  in  [10].  We  use  100  images  for  training  and  100 
images  for  testing.  As  this  dataset  has  very  few  images  of 
clean  backgrounds,  we  use  a  ground-truth  labeling  of  fore¬ 
ground/background  pixels  to  exclude  those  foreground  pix¬ 
els  from  the  training  set  of  background  images. 

4.1.  Basic  Warp 

We  select  five  bird-less  images  from  our  training  set  as 
our  canonical  backgrounds,  for  the  following  experiments. 
We  start  by  using  a  single  canonical  image,  and  vary  the 
neighborhood  in  which  we  search  for  a  warping  match. 
Fig.  4  indicates  that  performance  is  hardly  affected  by  dif¬ 
ferent  Ax  values,  whether  we  use  the  grayscale  intensity 
as  our  feature,  or  color  (in  the  YUV  space).  We  compare 
these  results  to  Elgammal’s  approach,  and  find  that  warping 
performs  significantly  worse.  A  closer  look  at  the  results 


Figure  3.  The  top  row  shows  the  resulting  difference  images  when 
using  intensities  as  the  feature,  and  the  bottom  row,  using  YUV  as 
the  feature  as  we  increase  the  number  of  canonical  images  used. 
From  left  to  right,  we  show  B  —  1,  3,  5,  respectively.  The  ad¬ 
dition  of  a  single  canonical  image  greatly  reduces  the  number  of 
unmatched  pixels.  There  is  little  visible  difference  between  B  —  3 
or  5,  indicating  that  most  occlusions  are  handled  in  the  first  2  or  3 
canonical  images. 
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yuv 


Figure  4.  Restricting  a  pixel’s  warp  to  its  spatial  neighborhood 
results  in  rather  poor  performance,  even  as  the  radius  of  the 
neighborhood,  Ax,  is  increased,  regardless  of  the  feature  used 
(grayscale  on  the  left,  YUV  on  the  right).  The  performance  is 
significantly  worse  than  Elgammal’s  approach,  shown  in  black. 


shown  in  Fig.  2  indicates  that  the  cause  for  such  failure  is 
the  inaccuracy  of  the  estimated  warping.  The  bright  white 
spots  indicate  pixels  that  could  not  be  matched  to  the  base 
image.  As  we  increase  Ax,  shown  consecutively  from  left 
to  right,  we  see  that  more  and  more  pixels  are  matched,  but 
a  significant  number  of  pixels  remain  unmatched. 
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Figure  5.  Occlusions  are  accounted  for  by  using  multiple  base 
canonical  images  (where  B  is  the  number  of  images  used).  We 
see  a  significant  performance  improvement,  as  well  as  providing 
cleaner  results  than  Elgammal’s  approach,  for  both  gray  (left)  and 
YUV  (right). 
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Figure  9.  Using  a  Gaussian  filter  improves  the  performance  as 
compared  to  Fig.  4.  As  expected,  increasing  the  variance  of  the 
Gaussian  window  improves  performance  in  both  intensity  feature 
space  (left)  and  YUV  (right). 


Figure  7.  Learned  step  width,  Ax,  for  each  pixel  for  intensity 
(YUV  yielded  similar  results).  As  expected,  regions  that  are  fairly 
stable,  such  as  the  feeder  platform,  have  smaller  step  sizes  (indi¬ 
cated  by  the  darker  color).  Learning  the  appropriate  Ax  for  each 
pixel  maintains  similar  performance  while  reducing  computation 
by  half,  as  shown  in  the  bar  graphs. 
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Figure  8.  Learning  Ax  for  each  pixel  results  in  very  similar  per¬ 
formance  as  compared  to  a  fixed  Ax  =  15,  regardless  of  the  num¬ 
ber  of  canonical  images,  B.  B  —  1  and  B  —  5  are  shown  here. 


It  is  reasonable  that  many  pixels  are  not  matched,  due  to 
the  large  amount  of  background  movement  occurring  in  the 
sequence.  Increasing  the  number  of  canonical  images  used, 
shown  on  Fig.  5,  overcomes  the  problem  of  unmatched  pix¬ 
els,  indicating  that  occlusion  was  indeed  the  limiting  factor. 
As  we  increase  the  number  of  canonical  images,  B ,  we  end 
up  outperforming  Elgammal’s  approach,  in  both  the  gray 
scale  and  YUV  feature  space.  As  indicated  from  Fig.  3, 
there  are  far  fewer  failed  matches  as  we  increase  B. 

Rather  than  defining  a  fixed  Ax,  we  attempt  to  learn  the 
appropriate  Ax  for  each  pixel  to  reduce  computation.  We 
start  off  by  seeding  the  training  algorithm  with  a  manually 


selected  canonical  image.  In  this  case,  we  use  the  first  im¬ 
age  in  the  sequence.  Then,  for  each  canonical  image,  we 
compute  the  best  warp  to  each  of  the  remaining  training  im¬ 
ages.  We  discard  warps  to  and  from  pixels  that  belong  to 
foreground  objects.  We  then  select  the  next  canonical  image 
by  choosing  the  training  image  that  has  the  most  unmatched 
pixels.  Implicit  in  this  assumption  is  that  pixels  that  cannot 
be  warped  correspond  to  occluded  pixels.  Therefore,  we  se¬ 
lect  the  image  that  reveals  the  most  of  the  background  that 
was  occluded  in  the  previously  selected  canonical  images. 
We  allow  warping  up  to  15  pixels  in  any  direction. 

Fig.  7  shows  the  range  estimated  by  this  method.  The  left 
image  shows  the  range  learned  where  white  pixels  indicate 
the  full  ±15  pixels  and  black  pixels  indicate  a  warp  range 
of  0.  This  learned  range  results  in  half  as  many  computa¬ 
tions  needed  to  estimate  the  warp  than  when  Ax  is  fixed  to 
15,  while  maintaining  a  similar  performance,  as  shown  in 
Fig.  8.  We  show  both  precision  recall  curves  with  B  =  1 
and  B  =  5.  The  rate  of  improvement  to  the  precision  recall 
curve  decreases  as  we  go  past  3  canonical  images,  indicat¬ 
ing  that  most  occluded  backgrounds  are  modeled  in  the  first 
3  canonical  images.  This  confirms  our  intuition  that  a  few 
layers  (leaves,  feeder  stations,  sky)  are  sufficient  to  capture 
the  phenomenology  of  the  data. 

4.2.  Using  a  Gaussian  window 

We  weight  possible  warping  to  regularize  our  matching 
scheme,  making  the  resulting  warp  less  sensitive  to  local 
minima.  We  test  with  several  cr2  and  find  that  this  greatly 
improves  the  performance  as  compared  to  the  uniform  warp 
shown  in  Fig.  4.  Fig.  9  shows  that,  as  we  increase  a2,  we  see 
a  change  in  performance.  With  a  large  enough  cr2,  we  ap¬ 
proach  the  accuracy  of  Elgammal’s  approach.  Fig.  6  shows 
a  smoother  difference  image,  with  few  unmatched  pixels,  in 
both  grayscale  and  YUV. 

Adding  base  images  (the  same  ones  as  used  previously) 
results  in  improved  performance.  Similar  to  when  a  uni¬ 
form  warping  was  used,  we  achieve  better  performance  as 
we  increase  B ,  as  shown  in  Fig.  10. 


Figure  6.  Resulting  difference  images  using  a  Gaussian  filter  when  selecting  the  best  warping  function,  where  the  top  row  uses  intensity 
as  the  feature,  and  the  bottom  row  uses  YUV.  Left  to  right:  1)  Raw  image,  It .  2-5)  Difference  image  when  a2  =  2, 10, 15,  30  respectively. 
As  we  increase  a2,  we  see  the  moving  foliage  fade  quickly  into  the  background. 


Figure  10.  Occlusions  are  accounted  for  by  using  multiple  base 
canonical  images  (where  B  is  the  number  of  images  used),  using 
the  Gaussian  window.  This  improves  upon  Elgammal’s  approach 
for  both  grayscale  (left)  and  color  (right)  images. 


Figure  11.  The  a2  learned  closely  matches  the  underlying  mo¬ 
tion  in  the  background.  Darker  pixels  (small  a2)  appear  where  the 
background  is  fairly  stationary,  and  lighter  pixels  (large  a2)  corre¬ 
spond  to  moving  areas.  The  bar  chart  on  the  right  shows  the  size 
of  the  search  space.  Using  the  learned  a2  results  in  a  search  space 
that  is  orders  of  magnitude  smaller. 


We  follow  the  same  procedure  used  to  learn  the  step  ra¬ 
dius  Ax  to  learn  the  a2  for  each  pixel,  and  find  that  the 
results  mirror  Fig.  8.  There  is  little  performance  loss  but 
much  greater  computational  efficiency,  as  shown  in  Fig.  11. 
The  total  search  space  (for  all  pixels),  as  illustrated  in  the 
bar  chart,  is  reduced  by  an  order  of  magnitude  when  the 
appropriate  step  size  is  learned. 


Figure  12.  Using  the  implicit  warp  model,  the  increasing  patch 
width  w  reduces  false  positives,  by  enforcing  a  spatial  warping 
across  larger  areas  of  the  image. 
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Figure  13.  There  is  a  tradeoff  in  selecting  the  appropriate  block 
size,  where  w  is  the  width  of  the  block.  As  we  increase  w ,  we  bet¬ 
ter  handle  background  motion  but  we  lose  some  precision  because 
our  granularity  is  now  at  the  block  level.  Also,  the  foreground 
object  contributes  less  to  the  distribution,  resulting  in  less  discrep¬ 
ancy  between  the  background  block  and  the  block  that  is  part  of 
the  foreground. 


4.3.  Implicit  warp 

We  experiment  with  various  patch  widths,  w  = 
{4,  8, 16,  32},  and  find  that  increasing  w  does  not  neces¬ 
sarily  result  in  better  performance  as  shown  in  Fig.  13.  A 
closer  inspection  of  the  resulting  difference  images,  Fig.  12, 
clarifies  why  this  is  so.  When  there  is  background  move¬ 
ment,  a  larger  block  accounts  for  larger  motion  from  back¬ 
ground  objects,  such  as  the  foliage  of  the  tree.  Yet,  if  the 
block  is  too  large,  foreground  objects  only  contribute  to 
a  small  part  of  the  overall  distribution,  resulting  in  little 
change  to  Dt(x). 

Using  multiple  canonical  images  results  in  even  better 
performance,  as  shown  in  Fig.  15.  Though  the  effect  is  not 


Figure  14.  Adding  canonical  images  accounts  for  the  displace¬ 
ment  of  background  objects.  Note  how  the  right  feeder  fades  into 
the  background  noise  as  the  number  of  bases  increases. 
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Figure  15.  Adding  canonical  images,  B ,  to  the  implicit  warping 
model  improves  performance  for  both  grayscale  and  color  images. 

as  dramatic  as  in  the  other  cases,  we  see  that  large  displace¬ 
ments,  such  as  the  right  feeder  that  swings  back  and  forth, 
fades  into  the  background  as  we  increase  B ,  as  shown  in 
Fig.  15. 

4.4.  Comparison  to  Prior  Work 

We  compare  our  implicit  warping  model  approach  to 
several  other  approaches  described  in  Section  2,  and  find 
that  we  mostly  outperform  the  state  of  the  art.  Elgam- 
mal’s  approach  [3]  suffers  because  each  background  pixel 
exhibits  a  wide  range  of  values,  effectively  making  all  possi¬ 
ble  values  background.  Sheikh’s  approach  [18]  is  similar  to 
our  approach  because  it  captures  the  local  spatial  neighbor¬ 
hood.  But  because  it  requires  a  locally  consistent  warping, 
it  suffers  from  the  same  problem  as  Elgammal’s  approach, 
that  each  background  pixel  exhibits  a  wide  range  of  values. 
Oliver’s  approach  [15]  does  not  model  individual  local  mo¬ 
tion,  resulting  in  confused  labeling  where  multiple  motions 
occur. 

This  work  builds  on  our  previous  work  [10],  but  extends 
it  by  allowing  multiple  background  layers  while  reduc¬ 
ing  computational  complexity.  Computationally,  the  pro¬ 
posed  method  is  O  (B  \  I\ ) ,  whereas  our  previous  approach  is 
0(Ax2\I\).  Since  patches  are  30  x  30  pixels,  computational 
savings  can  be  one  to  two  orders  of  magnitude.  More  meth¬ 
ods,  including  dynamic  texture  subtraction,  were  shown  to 
perform  poorly  for  this  data  set  in  [10]. 

Looking  at  the  average  precision,  our  approach  compares 
favorably  on  the  image  sequences  released  in  [  1  ] .  The  first 
100  frames  of  each  sequence  are  used  for  training  and  20 
images,  labeled  by  [ll]’s  authors,  are  used  for  testing.  The 
“Hall,”  “Lobby,”  and  “Mall”  image  sequences  contain  peo¬ 
ple  moving  around  indoor  scenes  that  are  fairly  static.  We 


Figure  16.  Our  implicit  warping  model  significantly  outper¬ 
forms  approaches  that  model  each  pixel  independently  [3],  spatial- 
appearance  models  [18],  and  linear  combination  approaches  [15]. 
For  the  most  part,  it  achieves  better  performance  than  [10],  while 
requiring  much  less  computation. 


[3] 

[10] 

Explicit 

Implicit 

Classic  Image  Sequences 

Hall 

54.24% 

22.13% 

67.11% 

40.64% 

Lobby 

11.66% 

5.24% 

13.00% 

7.15% 

Mall 

68.56% 

11.29% 

62.24% 

24.96% 

Moving  Background  Image  Sequences 

Trees 

36.89% 

64.92% 

51.83% 

75.35% 

Curtain 

86.89% 

69.48% 

87.94% 

94.09% 

Escalator 

62.43% 

19.04% 

64.55% 

62.49% 

Fountain 

47.33% 

49.62% 

57.83% 

71.21% 

Water 

90.68% 

53.07% 

93.08% 

93.88% 

Table  1 .  While  our  approaches  (both  explicit  and  implicit)  result 
in  higher  average  precision  than  [3]  and  [10]  in  classic  image  se¬ 
quences  (people  moving  in  indoor  environments),  the  greatest  ef¬ 
fect  is  seen  in  sequences  with  moving  backgrounds. 

see  that  Elgammal’s  and  the  explicit  approach  perform  bet¬ 
ter  than  our  previous  approach  and  the  implicit  approach 
that  has  inherent  smoothing.  The  remaining  sequences  con¬ 
sist  of  fairly  large  background  motion  from  the  object  that  is 
the  name  of  the  sequence  ( e.g .,  the  “Tree”  image  sequence 
has  moving  trees  in  the  background).  Both  our  explicit  and 
implicit  approaches  outperform  [3,  10]  in  these  cases. 

5.  Conclusion 

We  propose  a  warping  model  to  account  for  the  displace¬ 
ment  of  pixels  in  the  background  image.  We  model  the 
background  as  a  set  of  canonical  images  to  capture  the  dif¬ 
ferent  layers  of  background  that  appear  or  become  occluded 
as  background  objects  move.  We  find  that  the  proposed  ap¬ 
proach  better  models  the  background  in  the  case  where  there 
is  significant  motion,  as  demonstrated  on  image  sequences 
of  birds  at  a  feeder  station  and  more  general,  [11].  Fur¬ 
thermore,  the  implicit  warping  model  performs  better  and 
requires  less  computation  than  the  previous  state  of  the  art 
on  this  data  set. 
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Figure  17.  Sample  difference  images  for  the  different  approaches,  from  left  to  right:  1)  Raw  image.  2)  Elgammal  [3].  3)  Oliver  [15].  4) 
Sheikh  [  ]]  5)  Ko  [10]  6)  Our  implicit  warping  model.  We  see  that  our  approach  better  handles  the  background  motion  compared  to  (2-4) 
and  is  less  blurred  than  (5).  The  relative  difference  between  birds  and  the  swinging  feeder  station  is  larger  as  well  for  our  approach  vs. 
[  0]. 
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